Localizing Policy Gradient Estimates to Action Transitions
نویسندگان
چکیده
Function Approximation (FA) representations of the state-action value function Q have been proposed in order to reduce variance in performance gradients estimates, and thereby improve performance of Policy Gradient (PG) reinforcement learning in large continuous domains (e.g., the PIFA algorithm of Sutton et al. (in press)). We show empirically that although PIFA converges significantly faster than traditional PG algorithms such as REINFORCE which directly sample Q (without using FA), FA representations of Q are not necessary to reduce variance in performance gradient estimates, and PG algorithms which use selective direct samples of Q can converge orders of magnitude faster than PIFA. We present a new PG algorithm, called Action Transition Policy Gradient (ATPG), which uses direct samples of Q and restricts estimates of the gradient to coincide with action transitions, thus obtaining relative value estimates of executing actions, without using FA representations of Q. We prove that ATPG gives an unbiased estimate of the performance gradient, and converges to an optimal policy under piece-wise continuity conditions on the policy and the state-action value function. Further, in an experimental comparison with PIFA and REINFORCE, ATPG always outperforms both algorithms, taking orders of magnitude fewer iterations to converge on all but very simple problems.
منابع مشابه
Localizing Policy Gradient Estimates to Action
Function Approximation (FA) representations of the state-action value function Q have been proposed in order to reduce variance in performance gradients estimates, and thereby improve performance of Policy Gradient (PG) reinforcement learning in large continuous domains (e.g., the PIFA algorithm of Sutton et al. (in press)). We show empirically that although PIFA converges signiicantly faster t...
متن کاملBayesian Policy Gradient and Actor-Critic Algorithms
Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Many conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. The policy is improved by adjusting the parameters in the direction of the gradient estimate. Since Monte-Carlo methods tend to have high variance, a large num...
متن کاملEfficient Sample Reuse in Policy Gradients with Parameter-Based Exploration
The policy gradient approach is a flexible and powerful reinforcement learning method particularly for problems with continuous actions such as robot control. A common challenge is how to reduce the variance of policy gradient estimates for reliable policy updates. In this letter, we combine the following three ideas and give a highly effective policy gradient method: (1) policy gradients with ...
متن کاملExpected Policy Gradients for Reinforcement Learning
We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first derive a practical result for Gaussi...
متن کاملUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms*
Gradient based policy optimization algorithms suffer from high gradient variance, this is usually the result of using Monte Carlo estimates of the Qvalue function in the gradient calculation. By replacing this estimate with a function approximator on state-action space, the gradient variance can be reduced significantly. In this paper we present a method for the training of a Gaussian Process t...
متن کامل